[TRTLLM-6222][feat] Several perf opt for cuteDSL nvf4 gemm by liyuhannnnn · Pull Request #9428 · NVIDIA/TensorRT-LLM

liyuhannnnn · 2025-11-25T05:17:16Z

Summary by CodeRabbit

Dependencies
- Upgraded nvidia-cutlass-dsl from 4.3.0.dev0 to 4.3.0.
New Features
- Added prefetch support for dense blockscaled GEMM kernels with new configuration option.
- Expanded MMA tiler candidate configurations for improved kernel selection.
Performance
- Increased kernel compilation optimization level for better generated code efficiency.
Documentation
- Updated attribution references to reflect dependency version changes.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-11-25T05:20:41Z

📝 Walkthrough

Walkthrough

These changes release nvidia-cutlass-dsl from pre-release (4.3.0.dev0) to stable (4.3.0), introduce a use_prefetch parameter to the Blackwell dense blockscaled GEMM persistent kernel enabling conditional prefetch of tile data, expand MMA tiler candidates for exploration, and apply higher compiler optimization levels.

Changes

Cohort / File(s)	Change Summary
Version Updates `ATTRIBUTIONS-Python.md`, `requirements.txt`	Updated nvidia-cutlass-dsl version from 4.3.0.dev0 to 4.3.0 (released from pre-release)
Custom Ops Optimization `tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py`	Expanded mma_tiler_mn_candidates list with additional tiling configurations (256×64, 128×64, 256×192, 128×192); added compiler optimization flag (--opt-level 2) to GEMM invocation
Kernel Prefetch Feature `tensorrt_llm/_torch/cute_dsl_kernels/blackwell/dense_blockscaled_gemm_persistent.py`	Added use_prefetch parameter to Sm100BlockScaledPersistentDenseGemmKernel constructor and run() function; implemented conditional prefetch logic for tiles A, B, SFA, SFB with rolling prefetch support; added 192-wide N tiling support; adjusted memory and tile shape calculations; introduced min_blocks_per_mp for kernel launch; added --use_prefetch CLI flag

Sequence Diagram

sequenceDiagram
    participant Client
    participant Kernel as Sm100BlockScaledPersistentDenseGemmKernel
    participant Memory as Shared Memory
    participant Compute as Compute Units

    Note over Client,Compute: With use_prefetch=False (Original)
    Client->>Kernel: Launch kernel (use_prefetch=False)
    Kernel->>Memory: Load current tile A, B, SFA, SFB
    Kernel->>Compute: Compute MMA operations
    Kernel->>Memory: Load next tile
    Kernel->>Compute: Compute next MMA operations

    Note over Client,Compute: With use_prefetch=True (New)
    Client->>Kernel: Launch kernel (use_prefetch=True)
    Kernel->>Memory: Prefetch current tile (A, B, SFA, SFB)
    par Overlapping
        Kernel->>Memory: Prefetch next k_BLOCK tiles (rolling)
        Kernel->>Compute: Compute current MMA operations
    end
    Kernel->>Compute: Compute remaining operations
    Note over Kernel: Overlapping prefetch<br/>with compute reduces<br/>memory latency stalls

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Key areas requiring attention:
- Prefetch conditional logic flow and interaction with overlapping_accum mode in dense_blockscaled_gemm_persistent.py (lines with new prefetch handling, tmem/tiler calculations, accumulator buffer release timing)
- Memory layout transformations for 192-wide N tiling branches and their correctness across A/B/SFA/SFB paths
- Synchronization semantics of min_blocks_per_mp and its impact on shared memory allocation and kernel occupancy
- Consistency of use_prefetch parameter threading through constructor, run() function, and CLI argument parsing
- Verification that expanded mma_tiler_mn candidates and --opt-level 2 don't introduce performance regressions or compilation failures

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is incomplete - it contains only the template structure without filling in required sections like Description, Test Coverage, or specific details about the changes.	Fill in the Description section explaining the cuteDSL 4.3.0 update and performance optimizations, and provide specific test cases in the Test Coverage section that validate the changes.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 87.50% which is sufficient. The required threshold is 80.00%.
Title check	✅ Passed	The title directly reflects the main changes: updating cuteDSL to 4.3.0 and implementing performance optimizations for the NVF4 GEMM kernel, which align with the file-level summaries.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

Provide your own instructions using the high_level_summary_instructions setting.
Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

📝 Description — Summarize the main change in 50–60 words, explaining what was done.

📓 References — List relevant issues, discussions, documentation, or related PRs.

📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.

📊 Contributor Summary — Include a Markdown table showing contributions:
| Contributor | Lines Added | Lines Removed | Files Changed |

✔️ Additional Notes — Add any extra reviewer context.
Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py (1)
249-268: ---

Remove unnecessary f-string prefix from options argument

The options keyword argument is properly supported in nvidia-cutlass-dsl==4.3.0. However, the string "--opt-level 2" should be passed as a plain string, not an f-string. Remove the f prefix:
                    swap_ab,
-                   options=f"--opt-level 2",
+                   options="--opt-level 2",

🧹 Nitpick comments (6)

tensorrt_llm/_torch/cute_dsl_kernels/blackwell/dense_blockscaled_gemm_persistent.py (6)
64-64: Consider removing commented-out import or tracking via issue.

The commented-out import # from testing import benchmark and related commented code blocks (lines 2473-2481, 2558-2623) add noise. If this is intended for future CUPTI integration, consider removing and tracking via a GitHub issue instead of leaving commented code in the codebase.

104-130: Consider updating docstring to document use_prefetch parameter.

The new use_prefetch parameter is added to __init__ but the docstring (Args section) doesn't document it. Adding documentation would help users understand this optional performance optimization.

854-876: Consider adding a comment explaining the magic number 256.

The stride calculation (256 - self.num_sf_tmem_cols) * tCtAcc_fake.stride[0][1] uses 256 which appears to be related to SM100 tmem column capacity. A brief comment would improve maintainability.
+            # 256 is half of SM100_TMEM_CAPACITY_COLUMNS (512), representing the tmem columns per CTA pair
             tCtAcc_fake = cute.make_tensor(
                 tCtAcc_fake.iterator,
                 cute.make_layout(
                     tCtAcc_fake.shape,
                     stride = (
                         tCtAcc_fake.stride[0],
                         tCtAcc_fake.stride[1],
                         tCtAcc_fake.stride[2],
                         (256 - self.num_sf_tmem_cols) * tCtAcc_fake.stride[0][1]
                     ) 
                 )
             )
1898-1901: Use not in for membership test.

Per static analysis hint, use not in instead of not ... in for cleaner syntax.
         if not mma_tiler_mn[0] in [128, 256]:
             is_valid = False
-        if not mma_tiler_mn[1] in [64, 128, 192, 256]:
+        if mma_tiler_mn[1] not in [64, 128, 192, 256]:
             is_valid = False
Note: Line 1898 has the same pattern and could also be changed to if mma_tiler_mn[0] not in [128, 256]:.

2376-2377: Remove extraneous f prefix from string.

The f-string has no placeholders, so the f prefix is unnecessary.
-        options=f"--opt-level 2",
+        options="--opt-level 2",
2149-2181: Consider documenting use_prefetch parameter in docstring.

The run() function's docstring lists all other parameters but doesn't document the new use_prefetch parameter.

Add to the Args section:
        use_prefetch (bool, optional): If True, enables TMA prefetch for A, B,
            SFA, and SFB matrices. Defaults to False.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a4049fc and 53745a7.

📒 Files selected for processing (4)

ATTRIBUTIONS-Python.md (1 hunks)
requirements.txt (1 hunks)
tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py (2 hunks)
tensorrt_llm/_torch/cute_dsl_kernels/blackwell/dense_blockscaled_gemm_persistent.py (29 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used (e.g., use from package.subpackage import foo and then foo.SomeClass() instead of from package.subpackage.foo import SomeClass)
Python filenames should use snake_case (e.g., some_file.py)
Python class names should use PascalCase (e.g., class SomeClass)
Python function and method names should use snake_case (e.g., def my_awesome_function():)
Python local variable names should use snake_case, with prefix k for variable names that start with a number (e.g., k_99th_percentile = ...)
Python global variables should use upper snake_case with prefix G (e.g., G_MY_GLOBAL = ...)
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...)
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description (e.g., self.x = 5 followed by """<type>: Description of 'x'""" )
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except clause to the smallest set of specific errors possible instead of catching all exceptions
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible and use the else block to implement the logic

Files:

tensorrt_llm/_torch/cute_dsl_kernels/blackwell/dense_blockscaled_gemm_persistent.py
tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py

**/*.{cpp,h,cu,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code files should contain an NVIDIA copyright header that includes the current year at the top

Files:

tensorrt_llm/_torch/cute_dsl_kernels/blackwell/dense_blockscaled_gemm_persistent.py
tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py

🧠 Learnings (8)

📓 Common learnings

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h:999-1000
Timestamp: 2025-08-22T01:54:35.850Z
Learning: The `internal_cutlass_kernels` directory in TensorRT-LLM is a mirror of an internal NVIDIA repository and maintains its own implementation and API that may diverge from the public `cutlass_kernels` version. API inconsistencies between these two directories are intentional and by design, not bugs to be fixed.

📚 Learning: 2025-08-14T21:04:50.248Z

Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

tensorrt_llm/_torch/cute_dsl_kernels/blackwell/dense_blockscaled_gemm_persistent.py

📚 Learning: 2025-08-19T03:35:20.866Z

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.

Applied to files:

tensorrt_llm/_torch/cute_dsl_kernels/blackwell/dense_blockscaled_gemm_persistent.py
tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py

📚 Learning: 2025-08-21T21:48:35.135Z

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:399-417
Timestamp: 2025-08-21T21:48:35.135Z
Learning: CUTLASS extensions in TensorRT-LLM (located under cpp/tensorrt_llm/cutlass_extensions/) are designed to integrate with and extend functionality in the external CUTLASS repository. When analyzing these extensions, their consumers and functionality wiring may exist in the CUTLASS codebase rather than within TensorRT-LLM itself.

Applied to files:

ATTRIBUTIONS-Python.md
tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py

📚 Learning: 2025-11-14T11:22:03.729Z

Learnt from: nzmora-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 9163
File: tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py:107-113
Timestamp: 2025-11-14T11:22:03.729Z
Learning: In TensorRT-LLM AutoDeploy custom ops, when adding hardware capability checks to select between kernel implementations (e.g., cuBLAS vs. CUDA kernel), use descriptive variable names that identify the specific GPU architectures or families being targeted (e.g., `is_blackwell_geforce_or_ada`) rather than generic names like `enable_cuda_core`. This makes it clear that the code is selecting an implementation path based on hardware capabilities, not enabling/disabling hardware features.

Applied to files:

tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py

📚 Learning: 2025-08-09T20:57:04.084Z

Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py

📚 Learning: 2025-08-08T04:10:19.038Z

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6728
File: cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp:966-966
Timestamp: 2025-08-08T04:10:19.038Z
Learning: TensorRT plugins currently don't support padding functionality, and TensorRT is not getting new features (in maintenance mode). This means that duplicating parameters like mExpertHiddenSize in function calls, even with TODO comments, can be acceptable as pragmatic solutions within these constraints.

Applied to files:

tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py

📚 Learning: 2025-10-17T13:21:31.724Z

Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 8398
File: tensorrt_llm/_torch/pyexecutor/sampling_utils.py:237-272
Timestamp: 2025-10-17T13:21:31.724Z
Learning: The setup.py file in TensorRT-LLM explicitly requires Python 3.10+ via `python_requires=">=3.10, <4"`, making match/case statements and other Python 3.10+ features appropriate throughout the codebase.

Applied to files:

requirements.txt

🧬 Code graph analysis (2)

tensorrt_llm/_torch/cute_dsl_kernels/blackwell/dense_blockscaled_gemm_persistent.py (1)

tensorrt_llm/_torch/cute_dsl_kernels/blackwell/custom_pipeline.py (1)

consumer_release (217-229)

tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py (1)

cpp/tests/unit_tests/kernels/fused_gated_gemm/gemmSwigluRunnerTest.cu (1)

options (192-192)

🪛 Ruff (0.14.5)

tensorrt_llm/_torch/cute_dsl_kernels/blackwell/dense_blockscaled_gemm_persistent.py

1900-1900: Test for membership should be not in

Convert to not in

(E713)

2376-2376: f-string without any placeholders

Remove extraneous f prefix

(F541)

tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py

267-267: f-string without any placeholders

Remove extraneous f prefix

(F541)

🔇 Additional comments (13)

tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py (1)

86-96: Expanded MMA tiler candidates look reasonable

Adding (256, 64), (128, 64), (256, 192), and (128, 192) simply enlarges the search space, while kernel_class.can_implement(...) still filters out unsupported shapes. This should be safe and may improve autotuned performance on some matrices.

requirements.txt (1)

72-72: nvidia-cutlass-dsl 4.3.0 version bump verified

PyPI has nvidia-cutlass-dsl version 4.3.0 published with prebuilt wheels for CPython 3.10–3.13 on manylinux x86_64 and aarch64. The package supports Python 3.10+ and requires CUDA Toolkit (minimum 11.4, compatible with versions 11.4–12.x), so the version bump from 4.3.0.dev0 to 4.3.0 is ready to merge.

ATTRIBUTIONS-Python.md (1)

25253-25253: Version update is consistent with PR objectives.

The nvidia-cutlass-dsl version bump from 4.2.1 to 4.3.0 aligns with the PR goal to update cuteDSL to the stable 4.3.0 release.

tensorrt_llm/_torch/cute_dsl_kernels/blackwell/dense_blockscaled_gemm_persistent.py (10)

220-254: LGTM!

The new cta_tile_shape_mnk_sfb and epi_tile_n attributes are correctly computed and provide the necessary values for the new overlapping accumulator and 192-wide tiling support.

304-315: LGTM - complex tmem allocation logic for overlapping accumulator mode.

The calculations for num_accumulator_tmem_cols and iter_acc_early_release_in_epilogue correctly implement the overlapping accumulator optimization. The formula cta_tile_shape_mnk[1] * 2 - num_sf_tmem_cols ensures proper tmem layout when overlapping is enabled.

458-481: LGTM - 192-wide N tiling support.

The layout transformation for tma_tensor_sfb when cta_tile_shape_mnk[1] == 192 correctly handles the non-power-of-2 case with appropriate shape and stride adjustments. The comment explaining the right multiplication for ScaledBasis is helpful.

590-592: LGTM!

Adding min_blocks_per_mp=1 is appropriate for persistent kernel scheduling.

931-1019: LGTM - TMA prefetch implementation.

The prefetch logic correctly implements:

Initial prefetch of up to prefetch_dist blocks before the main loop

Rolling prefetch of future tiles within the main loop

The conditional k_block < k_block_cnt - prefetch_dist properly prevents out-of-bounds prefetch at the end of the K dimension.

1066-1091: LGTM!

The tmem pointer offsets for SFA and SFB correctly use the new num_accumulator_tmem_cols and num_sfa_tmem_cols values, ensuring proper memory layout for both overlapping and non-overlapping modes.

1121-1129: LGTM!

The acc_stage_index calculation correctly uses phase XOR 1 for overlapping accumulator mode (double-buffering) versus direct index for non-overlapping mode.

1145-1162: LGTM!

The SFB tmem pointer offset handling correctly addresses the different memory access patterns for 192-wide (non-power-of-2) and 64-wide N tiling configurations.

1354-1451: LGTM - Epilogue overlapping accumulator handling.

The epilogue correctly implements:

Reversed subtile iteration based on accumulator buffer phase

Early accumulator buffer release after iter_acc_early_release_in_epilogue iterations

Proper fencing (fence_view_async_tmem_load) before early release

Consistent use of real_subtile_idx in TMA store

This enables better pipelining by releasing the accumulator buffer before the full epilogue completes.

2140-2141: LGTM!

The use_prefetch parameter is properly wired through the CLI argument parser, run() function, print statements, and kernel construction. The default value of False maintains backwards compatibility.

Also applies to: 2529-2534, 2593-2593

liyuhannnnn · 2025-11-28T07:57:01Z

/bot -h

github-actions · 2025-11-28T07:57:11Z